Approximate String Joins in a Database (Almost) for Free Erratum
نویسندگان
چکیده
In [GIJ01a, GIJ01b] we described how to use q-grams in an RDBMS to perform approximate string joins. We also showed how to implement the approximate join using plain SQL queries. Specifically, we described three filters, count filter, position filter, and length filter, which can be used to execute efficiently the approximate join. The intuition behind the count filter was that strings that are similar have many q-grams in common. In particular, two strings s1 and s2 can have up to max{|s1|, |s2|} + q − 1 common q-grams. When s1 = s2, they have exactly that many q-grams in common. When s1 and s2 are within edit distance k, they share at least (max{|s1|, |s2|} + q − 1) − kq q-grams, since kq is the maximum number of q-grams that can be affected by k edit distance operations. We implemented count filter in the HAVING clause of the SQL statement in Figure 1. String pairs without enough q-grams in common are filtered out from the result. Unfortunately, this implementation of the count filter is problematic when kq is greater than or equal to max{|s1|, |s2|} + q − 1. In this case, two strings can be within edit distance k and still not share any q-grams. In such a case, the SQL statement in Figure 1 will fail to identify s1 and s2 as being within edit distance k, since there will be no q-grams from this string pair to join and count. Hence, in this case the result returned by the Figure 1 query is incomplete and suffers from “false negatives,” in contrast to our claim to the contrary in [GIJ01a, GIJ01b]. In general, the string pairs that are omitted are pairs of short strings. Even when these strings match within small edit distance, the match tends to be meaningless (e.g., “IBM” matches “ACM” within edit distance 2). However, when it is absolutely necessary to have no false negatives, we can make the appropriate modifications to the SQL query in Figure 1 so that it produces the correct results. Since the false negatives are only pairs of short strings, we can join all pairs of these small strings, using only the length filter, and UNION the result with the result of the SQL query described in [GIJ01a, GIJ01b]. We list the modified query in Figure 2.
منابع مشابه
Approximate String Joins in a Database (Almost) for Free
String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data especially for more complex queries involving joins. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not suppo...
متن کاملApproximate String Joins
String data is ubiquitous and is commonly used to correlate (or join) entities across autonomous, heterogeneous databases. The main challenge is to effectively deal with the noisy nature of string data, due to, for example, transcription errors, incomplete information, and multiple conventions for recording string valued attributes. Commercial databases do not support approximate string joins d...
متن کاملSupporting Similarity Operations Based on Approximate String Matching on the Web
Querying and integrating sources of structured data from the Web in most cases requires similarity-based concepts to deal with data level conflicts. This is due to the often erroneous and imprecise nature of the data and diverging conventions for their representation. On the other hand, Web databases offer only limited interfaces and almost no support for similarity queries. The approach presen...
متن کاملUsing q-grams in a DBMS for Approximate String Processing
String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not support approximate string queries directly, and it is a ...
متن کاملConvergent Inference with Leaky Joins
Over the past decade, a class of model database engines like BayesStore, MauveDB, and numerous others have emerged, allowing users to interact with probabilistic graphical models through queries. A key task for model databases, computing marginal probabilities grows exponentially in the complexity of the graph. Although exact solutions are feasible for smaller graphs, for larger graphs approxim...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003